Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only start Journald input with supported systemd versions #39605

Closed

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented May 16, 2024

Proposed commit message

Systemd/Journald has a bug that will cause Filebeat to be killed by a SIGBUS when reading from rotated logs. This bug is fixed in Systemd v255.

This commit checks the Systemd version when a Journald input is instantiated, if it is not supported, then then the input creation fails. A warning was added to the documentation stating the minimal version of Systemd.

It is possible to disable the Systemd version check by passing the --ignore-journald-version CLI flag.

A Ubuntu 2204 Vagrant Box is added for testing.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Filebeat will refuse to start if a Journald input is configured and the host is running an unsupported version of Journald.

There should be no disruptions for the current users because the Journald input is still in technical preview.

When running Elastic-Agent and using the "Custom Journald logs" integration, there is no way to disable the current Systemd version check.

Author's Checklist

  • We can test the Journald input on CI
  • Any problems with adding the fork system call?
  • Do we want to use D-Bus?
  • Define the desired behaviour under Elastic-Agent

How to test this PR locally

  1. Start two VMs: Ubuntu 2204 and Archlinux

    vagrant up ubuntu2204
    vagrant up arch
    
  2. SSH into the Archlinux VM and update the system to get the fixed version of Systemd and install other dependencies

    sudo pacman -Sy archlinux-keyring --noconfirm
    sudo pacman -Syu  --noconfirm
    sudo pacman -Sy base-devel --noconfirm
    sudo reboot
    # ssh into the VM again, install Go
    cd /tmp
    curl -L https://go.dev/dl/go`cat /vagrant/.go-version`.linux-amd64.tar.gz -o go.tar.gz
    sudo tar -C /usr/local -xzf ./go.tar.gz
    export PATH=$PATH:/usr/local/go/bin
    
  3. Run the tests

     cd /vagrant
     go test -tags=withjournald,cgo,linux ./filebeat/input/journald/
    
  4. Build Filebeat with Journald support (you can do this from within the VM):

    cd /vagrant/filebeat
    go build -tags="cgo,linux,withjournald" .
    
  5. Run Filebeat in each VM: ./filebeat -e -v using the following filebeat.yml

    filebeat.yml

    filebeat.inputs:
      - type: journald
        id: journald-input
    
    output.file:
      path: ${path.home}
      filename: output
      codec.json:
        pretty: true

  6. Assert that:

    • Filebeat starts and runs successfully in the Archlinux VM
    • Filebeat fails to start in the Ubuntu 2204 VM with the following error message:
      {"log.level":"error","@timestamp":"2024-05-16T19:28:17.850Z","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cmd/instance.handleError","file.name":"instance/beat.go","file.line":1345},"message":"Exiting: Failed to start crawler: starting input failed: error while initializing input: systemd version must be >= 255. Systemd version: 249","service.name":"filebeat","ecs.version":"1.6.0"}
      Exiting: Failed to start crawler: starting input failed: error while initializing input: systemd version must be >= 255. Systemd version: 249
      

Related issues

## Use cases

Screenshots

When running under Elastic-Agent on hosts with Systemd < v255, the Journald input will not start and will correctly report the error/reason.

Screenshots from various Linux distributions

2024-06-04_14-23
2024-06-04_14-24
2024-06-04_14-24_1
2024-06-04_14-26

Logs

Filebeat will fail to start with:

{"log.level":"error","@timestamp":"2024-05-16T19:28:17.850Z","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cmd/instance.handleError","file.name":"instance/beat.go","file.line":1345},"message":"Exiting: Failed to start crawler: starting input failed: error while initializing input: systemd version must be >= 255. Systemd version: 249","service.name":"filebeat","ecs.version":"1.6.0"}
Exiting: Failed to start crawler: starting input failed: error while initializing input: systemd version must be >= 255. Systemd version: 249

@belimawr belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 16, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels May 16, 2024
Copy link
Contributor

mergify bot commented May 16, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@belimawr belimawr marked this pull request as ready for review May 16, 2024 19:33
@belimawr belimawr requested a review from a team as a code owner May 16, 2024 19:33
@belimawr belimawr requested review from AndersonQ and rdner May 16, 2024 19:33
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Copy link
Contributor

mergify bot commented May 17, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b journald-causes-filebeat-to-panic-34077 upstream/journald-causes-filebeat-to-panic-34077
git merge upstream/main
git push upstream journald-causes-filebeat-to-panic-34077

@rdner rdner removed their request for review May 21, 2024 12:41
@belimawr belimawr force-pushed the journald-causes-filebeat-to-panic-34077 branch from 7a1460c to 32d12e3 Compare May 23, 2024 13:54
@belimawr belimawr marked this pull request as draft May 23, 2024 13:55
@belimawr
Copy link
Contributor Author

I converted this PR to draft because we still need to decide how if we want to move forward with D-Bus and the best way to test it.

@belimawr
Copy link
Contributor Author

The Systemd D-Bus documentation clearly states the version property should not be parsed because its scheme may change and it is not part of the API.

Version encodes the version string of the running systemd instance. Note that the version string is purely informational, it should not be parsed, one may not assume the version to be formatted in any particular way. We take the liberty to change the versioning scheme at any time and it is not part of the API.

However, it seems to be stable enough among the versions I manually tested:

  • Archlinux
  • Ubuntu 2204
  • AmazonLinux2

So I believe it is safe to move forward using it.

It does add the D-Bus dependency to the Journald input, which I believe is not a problem to run the input, but we will need to make some changes in CI to run the tests.

@belimawr belimawr added the backport-skip Skip notification from the automated backport with mergify label May 23, 2024
@belimawr belimawr force-pushed the journald-causes-filebeat-to-panic-34077 branch from c959999 to d5e4f5d Compare May 23, 2024 20:06
@belimawr belimawr force-pushed the journald-causes-filebeat-to-panic-34077 branch from 5f1e184 to c9e178c Compare June 3, 2024 20:34
@belimawr belimawr marked this pull request as ready for review June 3, 2024 21:09
@belimawr belimawr requested review from cmacknz and leehinman June 3, 2024 21:09
@belimawr belimawr marked this pull request as draft June 4, 2024 16:42
@belimawr
Copy link
Contributor Author

belimawr commented Jun 4, 2024

I'm putting this PR back in draft because I started testing the behaviour when running under Elastic-Agent I found out the D-Bus package we're using in Beats requires dbus-launch and it is not available in some (or most) VMs 🤦‍♂️ .

I'm investigating if github.com/coreos/go-systemd/dbus can provide the functionally we need.

@belimawr belimawr force-pushed the journald-causes-filebeat-to-panic-34077 branch from 9159c8d to 1b77913 Compare June 4, 2024 17:56
@belimawr belimawr marked this pull request as ready for review June 4, 2024 17:56
@belimawr
Copy link
Contributor Author

belimawr commented Jun 4, 2024

Using github.com/coreos/go-systemd/v22/dbus instead of github.com/godbus/dbus/v5 solved the problem. It was quicker than I expected. I'm doing some manual testing with some VMs to be on the safe side.

Copy link
Contributor

mergify bot commented Jun 6, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b journald-causes-filebeat-to-panic-34077 upstream/journald-causes-filebeat-to-panic-34077
git merge upstream/main
git push upstream journald-causes-filebeat-to-panic-34077

Copy link
Member

@AndersonQ AndersonQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the blocker is just the typo
apart from that, there is a question for you to answer

Comment on lines +376 to +385
// getSystemdVersionViaDBus gets the Systemd version from sd-bus
//
// The Systemd D-Bus documentation states:
//
// Version encodes the version string of the running systemd
// instance. Note that the version string is purely informational,
// it should not be parsed, one may not assume the version to be
// formatted in any particular way. We take the liberty to change
// the versioning scheme at any time and it is not part of the API.
// Source: https://www.freedesktop.org/wiki/Software/systemd/dbus/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question]
Then why to use dbus to get the version? Is it the most generic way of doing that across all linux distros?

func systemdVersion() (int, error) {
versionStr, err := getSystemdVersionViaDBus()
if err != nil {
return 0, fmt.Errorf("caanot get Systemd version: %w", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

Suggested change
return 0, fmt.Errorf("caanot get Systemd version: %w", err)
return 0, fmt.Errorf("cannot get Systemd version: %w", err)

Comment on lines +193 to +204
// TestJournald executes the Journald input tests
// Use TEST_COVERAGE=true to enable code coverage profiling.
// Use RACE_DETECTOR=true to enable the race detector.
func TestJournaldInput(ctx context.Context) error {
utArgs := devtools.DefaultGoTestUnitArgs()
utArgs.Packages = []string{"../../filebeat/input/journald"}
if devtools.Platform.GOOS == "linux" {
utArgs.ExtraFlags = append(utArgs.ExtraFlags, "-tags=withjournald")
}

return devtools.GoTest(ctx, utArgs)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't a way to avoid duplicating that?

// The function will parse and return the 3 digit major version, minor version
// and patch are ignored.
func parseSystemdVersion(ver string) (int, error) {
re := regexp.MustCompile(`(v)?(?P<version>\d\d\d)(\.)?`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 3 digits makes me a little nervous, if systemd goes to 4 this all breaks.

^(?P<version>\d{3,}) works with the examples provided. thoughts?

@belimawr
Copy link
Contributor Author

@AndersonQ, @leehinman thank you so much for the reviews! I was discussing with Craig and due to other issues and inconsistencies in versions and behaviour/crashes we face when calling libsystemd n the Systemd version across the supported Linux distribution, we decided to call journalctl directly, so this PR is not needed any more.

OpenTelemetry calls journalctl directly and on my tests it does not face the issues/crashes we're experiencing.

So I'll close this PR in favour of #39820 that will be a real fix for #34077 instead of just a mitigation.

@belimawr belimawr closed this Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Filebeat] Journald causes Filebeat to crash
5 participants